DOCOBBIT BBSOHB, 



ID 164 610 

r.. ;;' 

lothor 

mh' OiTE 
rOTB 



toss PRICE 
DESCRIPTORS 



TH Ob8 178 

MQTZ, William R. ; Rudnef » Lawrence H. 
Bias in" Testing: A Presentation of Selected 
Hethods. 
Mar 78 

30 p*. ; Paper presented at the Annual IJeeting of the 
American Educational l.esearch Association (62nd» ' 
Toronto, Ontario, Canada; Hafch 27-31, 1978) 

MP-$0.83 HC-$2.06 Plus Postage. ' 
Analysis of Covariance; Analysis of Variance; 
Complexity Level; *Evaluation .Criteria; *Bvali^tion 
, Methods? Factor Anaiysisj Itea. Analysis; Predxative 

1 ' validity; *Predictor Variables; Scores;' *Test Bias; 

^ *Test Items; Test Selection , 
[DENTIFIEBS Chi'squaure; Item Characteristic Curve Theory; Item 
' ' Discrimination (Tests) 

■'"r.- ■ ■ ' . . ' 

ABSTRACT " \ 

- \l A variety of terms related to te^ biais or test 
!,kirness have been used in a variety of ways, but ii^ this document 
ihe "fair ttsi of .tests" is defined as equitable seledt4.bn procedures 
^7 means of intact tests, and ".test item bj.as" refer sr to -^^he study of 
separate items with respect to the tests of which they are a {)art. 
Seven different operational definitions of the fair use of tests are 
tescribed; distinctions made between those applied to fairnfess fop 
liiidivi duals who are members of^^ispecial groups, and those el pplied to 
fairness for groups but not th^eir individual •members. All seven - 
ipproaches use the regression mo.del. One method also requires the use 
>f expected utilities. Various methods are described for, 
Lnirestigating test item bias. Both classical test theory item 
liiaiy^is and latent trait item characteristic curve approaches are 
ientioned. Tests for, bias included analysis of variance, chi-square, 
factor analysis, and arbitrary confidence bands-. Distractor responses 
Lnd item-test point-biserial correlations have also been considered. 
Lll of these methods are described briefly, but no attempt wa*s made 
to evaluate then. (CTM) \ . 



********************* *♦♦♦**♦***********>* 
i« , Beprpductions supplied by EDRS are the best that can be made * 
t from the original document. * 

>♦♦*♦»♦♦**♦'♦♦********♦*****♦ **♦****>****************** *^ •*♦****♦♦* 



Bias in Tes^ingf ^A Presentation of 
Selecffeii Methods 



, U.S.Ol^AHTMINTOFHKALTM, 
lOUCATIONftWILFAKC 
NATIONAL IMSmUTt OF 
CDUCATION 

THIS OOCUWENT MAS BEEN «iP»0. 
OUCED EXACTUV AS «ECCIVPO FROM 
THE PERSON OR ORGANIZATION pRIGIN- 
ATINOJT. POII^TSOF VIEW OR OPINIONS 
STATED DO NOT NECESSARILY REPRC. 
SENT OFFICIAL NATIONAL INSTITUTE OF 
EDUCATION POSITION OR POLICY 



William R. Merz, Ph.D. 
Associate Professor, Department of 
Behavioral Sciences in Education 
California State University, Sacramento 

and 1 I 

■ ■ ■ \ .1 

Lawrence M; Rudner, Ph.D. 
Research Specialist, Model' Second&ry .S^^^P^^ the Deaf 

Gallaudet iCoJLlege 



•PEfflWISS^IOM TO REPRODUCE THIS 
MATERI^-' HAS BEEN GRANTED BY 



TO TlhHE EOUCATJONAL RESOURCES 
INFORMATION CENTER lERlC) AND 
USERS OF JHE EBIC.^YSTEM." 



A paper presented as part of a Co-Sponsored Symposiam, • 
Empiric2l Evaluatiohs of Various 'Approaches for Identifying Biased 
Test" Item(s), at the Anrxual Meeting of the American Educational 
•Researcli Association and the National Council for Measurement m 
Education, Toronto, Canada, March 1978. • _ ; 



Printed in U.S.A. 



The issue of collecting :f air information on the performance 
of members *f identifiable groups is a major ^pblem in the con- 
struct ion^^and vise of tests in schools, government, an^ industry 
in the 'United States of America. Interest* in this igsue has^^J^^^^^^^^ 
broadened beyond the. boundaries of th^ United States to other 
countries wittf multiethnic and multilingual populations. These 
-issues have been addressed under various labels; however, the two 
most frequently used are test fairness and test bias. The two 
terms often have been used ifl!onymously . It is the contention 
of the pre^rit -reviewers (MejK. 1976; Rudner, 1977c) that syn- 
onymous use has led , to rf;ajor confusion about methods and loss 
of focus on the questions being addressed: 

One set of methoac|logies is used* for situations in which ; 
an intact test is administered to members of different groups* 
and these groups obtain different mean total scores. Under this 
conditioh the usual goal measurement is to provide data, for 
selection of applicants. The issue is using the test to pre- 
dict later success i.a a fair and equitable majiner. 4ere, one 
of the approaches to regression analysi^is applied with the 
test of interest as a predictor and some external measure of 
success as the priterion. 



.Anot^her set of methoaoJ'ogies ii^^olves the^ident if icaticn of 
items wjlich systematically differentiate among members pf a group. 
To dat4 these methodologies have not used an external criterion of 
success and have focused on single items from a pool which constiti 



or wi;i constitute the intact test. The focus of these efforts' ^ 
is usually the construction 61 a Measure which assesses a^onf 
tent domain without introducing systematic variance attributable : 
to factors other than those which are the. intended object of 

measurement* • ^ , , 

The pwrQpse of this paper is to describe the methods used 
to investigate thfe. presence 'of what has be'eir^abeled as "test 
bias!" In order to make a clear distinction between the two 
methodologies, the Examination of intact tests for equitable 
selection will be treated unSer the topic Fair Use of Tests; 
while examination of items within i test or an- item pocJl Sot . 
systematic performance differences among groups will be described 
under tha heading Test Item Bias . • 

Fair Use 6 f Tests ' ^ 

— 1-- 

■ - 

This type of investigation is of interest to test userS 

. . \ , . . . . . • • * . 

who need to know the accuracy of test information. Seven ap- 

proa-.res to determining fairnes^ in s.ele/;^ion are reviewed he're. 
Th^ seven are all regression approachesT^Wt .is , they attempt 
to predict from a selection or" placement instrument to a criter- 
ion of success-. Each uses a correlation-pl-ediction model, but 
,each differs in the way the criterion cut-off score is adjuste;! 
to yield fair estimates of success. Thus, each method assumes 
that there is a valid, reliable, and unbiased criterion measure 
for members -of a given group, Eurthef, the other assumptions of 
regX-^ssion models also pertain— bivariate normality ' and homo- 
geneity. . The first assumption is. absolutely necessairy; if the 
criterion is not valid, reliable, and unbiased, then, the pre- 
diction method fails. The assumptions of normality and 



homogeneity ^ystemaacally 'affect the magnitude of the correla- 
tion coefficient and,, thereby, influence the accuracy of pre- 
' diction. Of course, a sufficient number- |bf examinees must be 
available to compute stable correlations, or the method is 
Unreliable. 

The first method, labeled the regression model by Petersen 
and Novick^C 1976), was described by Cleary (1968). It defines 
a test as fair if there are no consistent non-zero, errors of 

prediction for member^ of eaG,ri subgroup of the population. This 

■ ' \ ■ • ■■ . \ . ■ • 

relationship is described by this equation: 

■■• V . ■' ■ , ■ , . : - 

iwhere represents the intercept ' , 

. ' ■ e^" represents the slope, and 

ic* repre^errts the predictor pass score for ^ 
s ^ subpopulatioii ir. (i = l,...,g)- 

Here a different regression equation is calculated for each , sub 
group. Sorrections are made. because of differences in mean 
values of X and Y among subgroups. However, only one acceptabl 
criter ion score is used. Hence, Parlington^ (/r971^ ^'iews this 
situation as r^^ "'cy/'^xy' ^hat'is, the correlation between 
group membership and the predictor is equal to the ratio of the 
correlation between group membership and. thQ criterion to the 
correlation between precfictor • and criterion'. TEfe, .focus of this 
method is on fairness to^e in^dividual gather than on fairness 
to the group. It is. -the most widely used approach to f-air 
'selection. } » ; - ^ . 



ratia 



' The second method w&s described by Thorndike (1971) and 
developed by Cole (1973); it was called the constant ratio 
moddl by PeterseA and Novick (1976). A test is. fair if tt 
identifies a^licants for selection in such a way that the 
of the proportion selected to the proportion successful is the 
same in all subpopulations. Here the relationship may be 

described as : i 

Prob(X^x*|Trj) . Prob(X»XglTrg) 

R = — = . . . = ■ 5" ■ ^ 

Prob(Y>y l^r^) Prob(Y»y . |TTg) 

where R is a fixed constant for subpopulations and 
X* represents the predictor cut-off score for subpopulation / 
7r'^(i = 1, . . . ,g) . . 

This 'method focuses on fairness to the group rather than on 
fa-irness to the individual. It requires, in addition to the 
.general assumptions ' listed earlier that 'a constant ratio of suc- 
cess is reasonable^-in all subgroups; here, r^^ = T^^ (Darlington 
1971). this approach is used where^equity between grouj)S is the 
Central consideration and in situations where the differences 
'between the" means for predictors is different from^the differen- 
ces between means for the criterion. Emphasis is on false 
successful, and false unsuccessful predictions, as .well as on 
accurate predictions Petersen and Novick (1976). 
• A third approach was proposed by Elnhorn and Bass (1971); 

It was labeled the equal risk model by Petersen and Novick 
(1976). It defines a test as fair when all persons selected 
are predicted to be aboy.e a specific minimum point on the ' 
criterion with a specif ied. degree of confidence.. In this" case, 



.Z - I»rob(Y>y* I X - xj, ti^) - 
- Prob(y>y* I X » x^, ti^) 



wher,e Z is a fixed constant probability for all*- sub- 
populations TT^' (i = l,...,g) 

X* represents the predictor cut-off score for a sub- 
1 ■ • 

population, and 

Y* represents the Criterion cut -off score. 

r • • 

This is accomplished by adjusting the criterion passing 
score with a confidence band, so that * . 

y' = Y - Z (s ) 

c p ^ yx' • . 

where Z is a z-score which can be designated by the 

P . c 

desired degree of risk, 

' ■ ■ . <• 

Y 'is the criterion cutr-off, and t 

c . • , 

s is the standard error of estimate. ^ 
^'y , . . ■ . 

This approach allows "'separate cut-off pptnts for each subpopula-, 

tion if the standard errors for subpopulations are different. 
Where standard errors for each sub^opulat ion are equal, -yie 
results reduce to the situation described for the regression 
model. It, too, focuses on the group rather than the individual. 
Other assumptions are similar to the regression model. In ad-' . 
dition, it must be Ibgical to expect- the probability of success, 
in each^group to be equal. - ' , " 

Darlington (1971) suggested a . model which would replace the ■ 
concept of cultural fairness with ai^other which he labels cultural 
bptimalityf hence, it was called the cultAife modi*ied criterion ; 



approach by Petersen and Novlck (X976). DarUngton deffnes a 
test as culturally optimal when; (1) a subjective policy Ibvel 
question is answered concerning the optimum balance between 
performance and cultural factors, Vnd (2) an empirical relation- 
ship between a test 4ind a culture modified .variable (Y-kC) is 

established. • ' 

. . .* ■ : * 

Here. (Y - kC) = + 8^ 

where Y is the criterion 

\ k is a constant subjective value judgment qn;^ 

■ i 

the part of the decision maker, 

C denotes an applicant's group membership - 

• a, is- the intercept, 

1 . . . , ' A 

B, is the slope, and 
X* is the predictor pass score "for subpopulation.s 

TT^ (i = 1 . . ,g) . . . . 1 

The criterion score .is adjusted by ^ predetermined amour^t 
based upon group membership-. Here, in addition to the other'! 
assumptions inherent in a regression a/proach, one must see sbme 
value in selecting members of a subpopulation and that value , 
rtiiist be' translated into the constant which adjusts the c-ritei/ion 
The process- of .adjusting criterion scores is open and may be 

debated 'publicly. • ^ * . 

Cole (1973) proposed a fifth method labeled the conditional 
probability model by f>etersen and Novick (1976). In. this model, 
a test regarded fair' if , given satisfactory criterion per- 
formance individuals have' the same probability* of selection 
regarcjless of group membership. - • . _I 



Here, K *- Prob(X>Jt* f YJiy*, ir^) 



( 



» Prob(X>x* f Yig , , 

where K is a fixed constant for subpopulations 

iTj^(i = 1, . .-. ,g) apd 

X? represents the predictor pass score lor sub- 
• , • f i • . . . ■ . , 

f^opulation Ti^. This^model, looks for equity to the group. The 

emphasis is on false unsxiccessf ul predictions as wpll as on^ atJ- 

curate predictions (Petersen and Novlck, 1976). In addition tb 

the other assumptions- mentioned earlier, it must be reasonable 

to expect that all groups t)erform equally well, on- the criterion. 

The sixth model, was proposed by Linn (1973) and defines a , 

test as fair if all applicants who are selected are guaranteed ^ 

• an equal, or fair, chance of being sx^cessful regardless of group 

membership. This model was labeled/the equal probability model 

by Petersen and Novick (1976). /' . ' 

■ . ■ • ' (■ ■ ' . ■ 

^Here, Q = ProbCY^y I.X^x^, tt^)^- .;. 

^ Prob(Y^y* | X^x*, tt^) . , 

-where, Q^is a fixed co^staht for all subpopula^tions 

'' ' ■» -. • ■ - . r • ' , : ' 

11 . (.1 = 1, . . , ,g) . ^ \ 

A \ ■ " ' ■ ■ • . , # ■ y 

\' x/ represents thfe precil9tor cut-off score for 

subpopulation^TT . , • : 

It / too ^ 'seeks^quity - for the- group. It emphasizes- false .un- 

' - ■ • ' ■ 

successful predictions as well as ^.^curate predictions ' (Petersen,, 
and Novlck, 1976). In addition to' fhe other assumptions mention- 
ed earlier, it must be. reasorfable to ^xpect all groups to per- 
form equally well t)n ^t he predictor, . < , . : ' 



The last model be reviewed was proposed by Gross and Su 

(1975) and was labeled the threshold utility mode?' by Petersen 

and No^ick (1976). It states that a test .is' fair if an indivi- 

• ., ' . .. ' ' ' 

"dual from a subp'opulation "is selected when his/her predicted ; 

score reaches a specific minimum point on the criterion which \ 

has been modified in such a way that the expected utility of 

the selection process is a maximum. Here the utility of the 

selection process is founA with * 



,2 



Or 
^2-1 



, efutO)] •= /z- P. E u(OjTr.) Prob (O. jTT.) . 
• 1=1 ^ j=l ^ ^ . , ' ■ . • 

where P: is the proportion bf the combined applicant 'population 
(n, and ir„) who are members "of the subpopulation . 
This assumes '^t hat four outcomes are possible^/ / 

X>K* Y>.y* An applicant is accepted^ and is jsuccefesful 
X^xf Yjy* * Ah applicant 'is rejected but would have 
' been' successful ^ ' . \ 

.... . . . » * » . • 

0^:% X^x. Y>y An applicant is rejected and would* have ; 
- • : been unsuccessful • . ^ V 

° ' 0.: X>x. Y^y An applicant is accepted and is unsuccessful 

utilities' usually .differ for each subpopulation, ; and re- 

— J.-' • - -' ' ■ < ■ • ' - . . • " ' ' ' 

gression ^equations may differ, too..^" The -method escapes the 

• ■ - ' " ■■ • ' ■° . 

difficulty of empliasizing -false successful and false unsucce&s- 

. .'*t." ■ . ■ ■• . ■' ^ 

ful'. predictions by seeking public statement of utilities and-,', 
then, rtiaximizin| the, likeli|iood of that utility far a given 



y Test. Item Bias • • ' - • 

This type of investigation is of interest^o test deyeldpe 
• • * * •■ " * . ■ * *' .•• *• 

because It assists "them in devising valid^" cross-oulture fair 

items and provides a framework" for constructing^' better test^ in 

Ijsubsequent .efforts. Six' approaches are reviewed here." 

Analysis of Vai'iancQ Approaches .; » . ^ 

Cardall«.ahd Cotfman C1964)' suggested a method of identify- 

ing bias using an analysis of var;lance framework vwhich incorpbr 

atee test litems 'and group'' membershif) as main eff ects\ ^Biais Is 

defined as a significant .Item by group interaction, that is, 

the^V^sence of items which are relatively more difficult for 

-/•■',• , * 

members of one culture group than, another. In order to meet 

the homogeneity of variance assumi^ion of the analysis ol var- 

iance ^-Cardall; aad Coffman transformed the. within ^grbup item 

difficulties with an arcsiri^ tr&nsformV >. • 

Plake a'nd.HOover (1977) extended the technique to allow" 

fo.r the .iftantiticatipn. of individual items. The interaption 

* ■ • ■ • . a 

contrast for each item «withl']i each group takes the form: 



where 9^ .* is thp ajrcsin- transformed Item difficulty 

,th ■ . ' _ ,tb ' 



for the i ■ and the J groi|p. 

' ■ i 

i'he error variance is \. 



V 



a 



. .■ where q is the number of items - . « 

5' is the harmonie mean of the number of subjects. in- 

the' J grpup. >g * . 



A Blinultftneous significant test, such as Bonfersoni'f iprpcedure, 
caa tlven be used to identijiy individual items which appear to 

be" biased. , • \' . 

'Clearyand Hilton* (196.8) .employed tf 'thxee -fac'toy,' te^d^ 
model .anaiysls\of varlairce to determine whether or norfl^temsj • 
within a "test were biased', again defining bi/as- as item^ by group 
.interaction, in, their arfalysis, race and ■soctoeconoml.c status - 
were considered fixed variables; while" pej^sons arid itfems were 
considered random. Socioeconomic status levels- were nested 
within race to avoid Assuming, that' the levels werre comJ)arable 

across the races. , o , ' \ ^ . 

■Other examples of this approach can be foui/d in Ea^le and . 
Harris i (1969) , Hpepfner and Strickland (1972) ajid Jensbn . C1973) . . 

If shoikid be noted thati these authors and Cardafll and Coffman 

■ . '. . /( . • W V •„ 

did not/ incorporate an arcsin transform.- / . . . • 

" /''—'.' . ' ■ • 

Ti:ansform' Itom Dlf/lcultles . ' - . 

The transformed item difflculHes approach, providing for a 
visual examiaat ion of item by group interaotion effects/ was pro— 
bably first described by Thurston (1925) in connection with his 
method of absolute scaling. 'Of the approaches, this method ap- 

■ , ■ ? 

pears to be on,e of the best itnown. It. has been advoe^ed and us6d 
frequently toy Ahgoff (1972), An^coff and 'Ford (1974), and Angof f 
and Modu( 1973), -and others <.Green and- Draper, 1972>-Jensen , 1973; 
Hicks, et al. 1976; Strassberg-Rosen'berg and Donlon, 1975; 
Echternacht, 1974; and Budner, i978). Further, the approach has 
appeaired in at least one measurement textbook (Anastasi^, 1976, V 
-pp. 222-226). ' . ' . 

. • • ■ * 

In this method, indices of item difficulty'; i.e., 
p-values,~are obtained for- t.wo different groups on* • 



a number of items. Each p-value is converted to a 
norma/l deviate and the pairs of normal deviates, one 
'ipair for each item, are plotted on a bivariate graph, 
teach pair represented by a point on the graph.. 
5 (Angoff , 1972, p. D 
■ ■ ■ ' ■ ' " ' ■ . ■ ■ I ' ' \ ■■ . 

The plot will generally be in the form of an ellipse. A 

45° line, passing .through the origin, provides a ^theoretical 

regression indicating the absence of bias. Items greatly' . 

deviating from' this line may be regarded as exhibiting an item 

by group interaction. Rela,tive to the other items,, deviant 

items are especially more difficult for members of one group 

than the other. Assuming both groups received similar instruct 

ionSi such items would appear to represent different psychologi 

cai meanings for the two groups examinees. 

Since the intent is to make comparisons of between-group 

differences in item 'difficulty , it is necessary to transform 

the proportion passing ah i^em to an index of ^ item, difficulty 

which constitutes at least an\interval scale. This is accompli 

-ed by. expressing each item p-val'u^^ in 'terms Of withm-group 

deviations.>of a normal curve (s^e Guilford, 1954, pp.. 4ia-419).. 

Any lihear transformation af thdi item z-score will meeij such a 

requii-ement . One such transformation has been Delta values!; 

•(4z^€ 13Lii:^.. \ ' . ■ . \ - . , 

The distance of an item point to the line, 

" ' ' ■ ■' ', '■ ' ■ I ■ . •■, - ' ■ ■. 

'\ d = (z, -..zo)/^ " ; ° ■ ' : . . 

■ ■ ■ ' ■ \ ■■" ^ ^ _ . ■ 

' -where z. is the transformed item difficulty for 

.• ^- J - ... - , .. - - 

group j. andsserves to indicate the degree of item bias. Items 
which are "greatl^y^e via ting" from the line are identified by a 
Traditional or nontrardd^t ion al method of outlier or residual 

■ : ■ ■■■■ ■ / ' •, ■ , '.■ ' ,"^u^: ^ '-^ ' 



analysis. One method is to place confidence limits on the line 
by using a multiple. of the (standard error of estimation. An 
alternate approach, adopted by Strassberg-Rossenberg and Donlon 
(19750 and Hicks, et al.-(1976) involves computing the standard, 
deviation if the residuals and classifying as biased those- items 
deviating by greater than 1.5 standard deviation units. Rud.ner 
(1978) has^eraployed a fixed item-regression line distance of 

.75 z-score units. ■ ; 

Echter-tTacht (197-1) also began with item difficulties which 

' ' ■ ■ ' ^ 

were transformed to delta values. Differences in transformed 

it^ difficulties were computed for each" pair of groups, and 
/these differences" were plotted on normal probability paper. 
A-ditiooally . a line' was plotted to represent a hypothetical 
normal distribution with, .the obtained mean and standard devia- . 
tion of the 'di-ffei-ence between pairs as parameters. Confidence 
-bands constructed around this line represent the. area outside 
of 'which biased, items would- fall.. . ^ . / 

Correlation Appfbaches , ■-, c . , 

■ ' These approaches examine the point biserial correlation 
coefficients between item performance and total score. Ozenne, 
• Van Gelder, and Cohen (1^74) coupled a graphing met ho,d ^S'ith the- 
point biserial correlation approach. First, item difficulty 
levels were plotted usingcone group as reference against which 
'^the other groups 'we/re .plotted. Items were "arranged ia order 
of difficulty for /the reference group from most difficult to 
. least difficult;- Item niimbers ,w£re plotted along the ordinate^., 
and item difficulty, '= along th^ ^bscissa. In this' case a 



publisher's national standardization sample was us6d as the 
reference group against which a minority sample was plotted: 
Visual examination of the plots revealed item by- group inter- 
actions when the uniformity of the shapes of the curves was 
disturbed. The magnitudes of differences we.re noi the concern; 
rather the deviation from the shape of the reference curve was 
noted. Then," point biserial correlations between item scores 
and total score were computed fqr each group that was to 6e 
compared. Correlations were compared to identify items which for 
a particular group did not contribute to total score; that is, 
items with a low item-total s^core correlation for a specific* 
group were examined for bias. Items were identifle^7^:s"^oten- 
tially biased by expert judgment based on the results of the 
two methods of analyjsis. 

Green's strategy was= used in standardizing the Comprehensive 
Tests, of Basic Skills. Form S ( Green , 1 976 ; CTB/-McGr aw-Hi 1 1 ^ . , 
1974). Again, point' bisefnal correlations were computed for 
each group on:- each item;' any item having a .correlation of less 
than .'20 lor any group was deleted. ! Green offered as evidence 
for the effectiveness of this strategy that fewer point biserial 
correlations fell below .^0 for blocks in the standardizatioft . . 
data ^ : ^ . 

Factor Analytic Approaches ' ' 

' In factor analysis, underlying f actors .(i. e. , dimensions^ 
or traits) are hypothesized and. the correlations of each 
variable with the hypothesized factors. are computed. In an 
ach^evem'en't t6st ,. each, item, is treated- as a variable. Such an 



analysis could be conducted twice usipg examinees ..from two dif- 
ferent cultural backgroundjs . Ideally, the two separate groups 
of examinees Would yield simitar-sets of item-trait correlations 
(factor loadings). Different sets of factor loadings would . 
^ndicate that the two groups . are not rfe^pontiing to the items in; 
the same manner! Such a test would be considered biased in 
that it would appear' to measuj^e different traits acrosSj^.groups. 
The items exhibiting the most bias would then be those with the 
largest differences in factor loading. j: / 

The general model for this type , of f actor analysis Is""- - 
' ■ X + A f^ + e 

where, x ^ vector of subject responses 

A is a matrix of factor loadings / 
f is a vector of factor variables (locations) . ' 
e is a vector of residual .^[i^^e^'ror terms - 
From values of A, f, and e Vare determined, ; 

Green and Draper (1972) and Green (1976) ^uggest an inter-' 
^group factor analysis model -based on t^^e inter^battery factor . 
analysis Approach' of fered by Tucker (1958). • In .this inter--, 
group model, the item variance. is partitioned^ into : (1 ) fatftoxs 

common to each subgrolip; (2> factors, specif io to subgroups , and 

1 . . - ■ - .) , , , > 

(3) residual or error variance. With this mcbdel onie can deter- 

mine the proportion of item variancev accounted for by a^ven 
subgroup.' An item, then, is* unbiased when ihis proportion is 
small and b.iaseci if .a large proportion of yariance rs attribut- 
able., to culture-specific ^biirce^. 

Merz (1973, 1.976) developed an alternate approach "Which 



incorporates factor scores and anlysis of variance. The item 
intercorrelation matrix. is computed for subjects pooled across 

. Oi . * • ■ 

I 

groups. The matrix is reduced with Principal Components 
Analysis," employing theScree technique to determine the number 
of factors to be extracted/ The factor matrix is then rotated, 
orthogonally to simple structure, and factor scores derived 
from the rotated matrix. 

B. An analysis of variance is then condiicted; on each set of 
factor scores using multiple group memberships as independent 
variables and factor scores for each vector as the dependent 
variable. Item bias is defined as a major loading on a factor 
with a significant F ratio on a main effect or orl an interaction 
Distracttar Response Analysis ' ' ^■ 

Veale; and Foreman {1975, recommend investigating the 

distractor response distribution for various cultural groups in' 
an approach not dependent on total test scdres,. Should one ' . 
group be" overly attracted to a particular distractor in compari- 
son to a second group, there may be a biasing characteristic of- 
"the item attracting. them away from the correct response.- Bias 
is defined as characteristics of an item which causes, a; di^tor- 
tiori in the item p-value for a Cultural group. 

Consider the choice distribution illustrated •!« Table 1. 

Observed frequencies appear in the cells and expected frequenc- 

ies appear in the upper right hand corner of -each cell. A 

. ■ - . ■ , " ' 

.-disproportionate number of members of Group 2^ were attracted- 

t,o Distra;Cto^ 1 (the response frequencies can be shown^, to be 
disproportionate ' by the use ^f a chi^^iuai-e - test ) . It is 



argued that some characteristic of Distractor i causedr a sub- 
stantial number of members of Group 2 to select .thi^ distractor 
over the correct alternative. Hence, some characteristics of 
the item may have caused a distortion in the group p-value. 
' . : Table 1 - 

A Hypothetical item Distractor Choice Distribution 

"Frequency of Selection^ 
Distractor 1 Distractor 2 



1 



Group 







60 ' 






40 


• - 


> 








f 




* 






40 


1 




' 60 






100 


/•'• 




r 


\.. 60 






40 






80^ 






y 

20 




* / 


100 




120 . 






80 X. 






200 . 





Maw (1977) has developed ans approach based on the work of 
Ku and Kull back (1974). For each item, accent ingency;., table is 
developed which' includes the item dlstractors and the culture 
groups as is done -by Veale and' Foreman. However, Maw inellid^s 
additional "variables which are known .correiates of educational 
achievement ; - e . g . , home' background^and- a'tt itudes , Instructional 
processes^ and socioeconomic status. These known correlates 
are expected '^t^i^account for, most of the item variance. Various 
logtinear models are fitted to-^the data until the data iS ' • 

.. ... . 16 ... ■. . ^ '■■ . 



adequately represented. The parameters of the model are then 
investigated- for information about the distractor patterns.. . ,^ 
Biased items are identified tfy a 'significant distractor-l^y- 
culture grdup m^^rginal effect TKe individual parameters of ^\ 
the marginal are then analyzed to determine the contributions, 

of ^he "various item re;^ponse choices. ^ 

^ / ■ . . * ^ . - - . ^ 

\tem Characteristic Curve Theory Approaches : 

Recently, la'tent trait theory has been- used to identify 

Lsed items (Green ;and Draper, 1972; Lprd, .1977; Rudiner, 1977a; 

Piile , 1976 ; Scheuneman , 19.75-, 1976; Durpvic , 1975 ; Wright , ^ 

Mead and Dfaba, 1976). In an early study.. Green and Draper had 

■ / • • ■■■>•■■ . ' ■ 

used observed total scores as estimates of Vxaminees' abilities, ^ 

( ' s ) and the proport ion-s of examinees xesponding correctly at 

each total score level as estimates of the probability of a 

correct response given 6 . [P(u = 1 T 6. )] ... Their j)rocedure 

called for plotting item character estimated iche curves (icon's) 

foJ each item, separately for each ^culture group , 'and comparing . ■ 

thte plots,. ^ ■ ■. ' 

. / By this and other latent, trait, theory approaches, an' item 

■ ■ — ■• ■ * . _ I . ■ -• 

is unbittsed if examinees of ihe same ^ability level, "but of dif- 
ferent cultural, affiliations, have equal probabili^;ies of 
responding correctly. That is, 'an item is' unbiased if the esti- 
mated ice's obtained from the various culture groups are identi^ - 
cai. As>n example of\a biased iteta, consider the "two "hypo- 
thetical curves shown iW Figure l '. - These curves aVe based on , 
responses by. two dif ferent culture groxips to the sai|le item. ^ 
Total observed scores are use4 as, estimates at 6^ and proportion^: 



■ o. 

bfi O 
C O 
•H CO 

O 

CQ O 
CO 0) 

© > 

'C 0) 
B ^ 

So. 
0) c 

. <H > 
0 'H 

^ o 

p 0) 
. OK 

u o 
0^ o 



o 
o 

• • 



CO 
CO 



o 
o 




X, ' = 58% 



Group "A 
••' ' ■ Group B • 



00% 



25% 50% , . 75% 

OBSERVED TOTAL .SCORE, 



100% 



A 



Figure 1: Two hypothetical response distributions 



wmc 



, 18 



. of examinees resppnding correctly are used estimates* of 

• Pfu =1 I 8,). The curves are not identical, since the loca- 

■ . ■ , 1 . _ - ; : : ^ . ■ 

tion parameters for the two curves are hot equal. Such /an item 
can be considered biased in tha.t often examinees o!f the same 
ability level ; e.g. , X', = 59%, but from different culture 
groups., do not have similar proportions of correct responses./" 
While this approach is appealing, total observed sgox^s 
^ are dir^ectly iricorporated and quantification of the degree of 
It^ra bias difficult . (an eyeballing|}rbcedur.e is used'to 
iden^tify^a "very biased item''). - * « 

lather than using total ob^ervod scores as estimates of 
and ^vjr^^portions as estimates for P(u =1 | 9.), more ac- 
curate values can be obtained- using otie of the recent methods ' 
of paran^.eterization (Urry , 1975 ; Wlngersky and Lord, 1973), 
During parameterizat ion . the meiric used for the 6 scale, is de- 
fined by the. ability variance in the exatmined s:ampl6. In order 
to compare parameters obtained from two different examiiiee 
^groups, the obtained values ,miist be equated. For the three 

y > ' .y' ■ , ■ '■ ■ ■ ' * . 

/ parameter model , Lord and Novic^^ (1974, Chapter 16.11) 4nd 

Rudner (1977b) have shown that this can .be accomplished by 

computing the regressions of the.^'parameter vallies .based on on^ 

group of examinees' on the parameter values based on the 9ther ^ 
group of examiiiees.- • \ ^ 

• RudAer C 1977a), Lord (.1977) , i an'd^Pine (1976) have refined 

1 .J ■ ■ . . 

the procedure used by Green and Draper to identify biased^' items , 
*by incorporating equated ice parameter values for the 3 ^ra- • 
meter Birnbau4 (1968 model). . Rudner used the aiMfa between, pairs 



'of e(|.u.ated ice's to indicate the relative amount of aberrance 
of eaoh item and eyeballing of' the equated ice's to provide 
additional information as to the nature of the, abb.e ranee. Lord '*' 
.has employed an asymptbtoic signlficanGe test ba^ed on;t^e sifin- * ^ 
•med variance-covariance matrices af the equated 'parameter > 
estimates to test for significant differences between, p^irs of . 
, -equated ice's. Pirie uses the residuals ^rom equating tte dif- 
ficulty and distrirriination parameters as an index of abterrpce.- 
U^lng the one parameter Rasch mo'del, DuroYic (1975') and . .. .^ 
Wright, Mead and Draba (1976) focus their attention- on dif fer- 
ences in the relative easiness of the items. The differences, 
between observed item responses and the predicted probabilities 
of a correct response are computed, Goodness-of-f it /residual 
is then atial'yzed for between-group differences. , . 

Scheuneman (1975, 1976) has. developed ^a technique whrch' / 
is similar* to the multi-parameter item characterisyic curve 
theory approach used by Green and .Draper, ' Coined the -ehl -Square 
approach ,. tHis;. approach seeks to' determiner whethey examinees 'of 
the same ability level have the same probabiU.ty /of a correct 
response regardless of cultural affiliation. .This is accomplish- 
ed. by blocking >ach tryout sample into 3- io 5' groups^ bas^d . on 
the observed scores and comparing the proportions of students 
within each level respoj^ding correctly. An it/em fs con side re(i 
Unbi'^sed if, fof all individuals^ih .the,same|total scoce-- inter- 
val', the proportion of correct responses is //the- same for both 
groups unsier consideration. : • , 

\ A modified chi-sqtiare Is Used to eslj^mate the prbbability 
• . . . 20 



that the item is unbiased by the above .definition. The ex- > 
" pected . values for each cell (E^j) are obtained by multiplying 
(V^' the proportion of all examinees with total' scores^ within *. 
interval j responding correctly to the item by (2) the. number . 
of examinees within the cell. That is, " ' 

; o.j * ' ■ • • 

E. . = tr^ N. . ^ . 

"* .. ■ " « f •- ■ 

• where 0 . i§ the number of examinees in totalvscbre 

' interval j responding correctly , ^\ 
N . is the total number of examinees, in Group 

•. ■ - ' . , . . 

; ; i and score interval j - - ' . 

• As with a conventvior^al chi-square, observed cell values , 

■ . ■ - ■ . , ■ ■• . - 

are simply. the number of examiflbees within the cell responding . 

^. ^ , ..... ' ■ ' ^ ^ ^ , 

correctly to the item. , , . ^ 

ft ' ■ * ' I ■ 

\ V : Summary ^ 

\ Selected methods for examining the lest performance of mem 

bers of identifiable" groups for fairness were presented. • T^o 

sets of methodQj.ogies were ident-if led: "* one in which an intact 

litest is administered to members of different groups to provide 

" . ■■ -- ' ■ '•" - - " . ■ " " . ■ ' ■ • ■ 

data for selection; the other, in which items fro.m a pool. are 
.examined for Systematic dif fereiitiation ampng g.roups.. The 
_pjjrpose of this pape^ was simply to describe the/methods. No- 
attempt to evaluate theni was~ma<leT : 



If 



References 



\ 



AnastaW-A. Psychological Testing" '(4th' ed. >. York v Mac^ 

iliHan. 19,76. - ■/ ■ \, : . ' - 

^ngoff , W. H. A techniflwe for, the invest 1 B^at ion o f cultural: 

differences . ' Paper presented at the anni^&l meeting of the • 
' "American Psychological. Association , Honolulu , May 1972 . ..* 
Angoff.. W. H.. & Ford: S. F. - Iteiil^'r^ce interaction on. a test 
• of ,rKmn-ti- -r^^^"-^- • -^""^^^^ of Education^;! Measurement:' 
1«73, 10, 95-105. . - : 7 V ,; 

- Angoff . W. H., ^ scales of the Prueba - 

: ^y .^^^i^x..^^c2. and the Scv^lastic Aptitude Test. Me4i ^ 

— ' '■ " ■ • ' . i /■ ' 

York: College Entrance Examinaftion Boafdr 

Cardan; C. & Cof fmanV W. R. A methc,d f orVcomparing per^^ 
. of different grpups on the items iii' a t^est'. ' (JIM 64-61) 
; i»rinceton: Educational Testing Service ; 1964 . . ■ 

Birnbaum, Av Some- latent tWit modeli and their use" "in inferring 

ail examinee's ability. In F.. M. liord 8^ M. R. Jfovik. v ' v 
- -/ Statistical Theories of Mental. Te st Scores. Re&ding, -MA': : 
Addison-Wesley ,1968 , .Chapters 17-20 . ,j . \ 

Cleary , T . V- Test «bi^s : ' Prediction of grades of Negto and 



y^Si^eT^Snts ig integrated colle^fs. JoiVrnnl of Educational 



Measi/rement ,' 1968?^, 1-15-124^ * .. - , ° . 

Clekry/r? A. . &( Hilton- • t/ Aii inV^s^^gatiprr into item^bias. 
Xr.atinna.1 / and Psychol nf i ^al Measurement 1^68 , -8. r6 1-75. 



] RIG 



V 22 



If'',' 



'V. ' Cole/N. S. Bias in selecti6n. Journal of Educational Measure- 

• . ment . 1973,^110. 237-255. r 
Darlington" R; b/ Another look, at -^^^^ Journal 

. " of T:<^ii'r.ft tionai Measurement , 1971, 8, 71-82. 

/:N ■ . ■ ' — ^ ' ~ 

* Durovio. J. Definition's of test bias: A taxonomy and an 
. illustration of anxalterrvative model. Unpublished. doctoral 

. • : dissertation, State University of New York at Albany. 1975-. 
■ • ■' Eagle, N. , & Harris. A. S. Interaction of race and tesf on 

" - ' reading performan^ce score's. Journal of -Educational Measure- 
. ment, 1969' 6j.l31-I35\ 
^ . ic'hternacht , G. A' quick method for determining test bias. 

. TTHnPr^ti onal and Psvcholo ri ^al Measurement ,- 1974. 34. 271-280. 
Einhorn H. J. , & Bass. A,. R. Methodological considerations 
■ . . ' .'relevant to .discrimination in Employment ^tpsting. Psycholo - 
. \ ' ^ . - gicai Bulletin . 1971,-.75. 26X-269. . ; . \ 

- Green n r -Rpduci 4 bias in achievement tests.. Paper pre- 
'sented at the annual . meeting of th^ American Educational^ 
Research Association, San Francisco, April 19\76. 
^; . -'..Green* D.- R. , & Draper. J- Y. Ex ploratoty- Stud\es\ of Biks in 
. . . ' AhhieVement Tests . Montg^^ey:- CTB/McGraw-Hill . 1972. 

- gV^.'A.'l.; & Su, W« Befitting k*"f%ir" on "unbiased" selec- ^ 
. : ' " tion model : _A3umU-Qii-^ci^ 

^ Psychology , 1975.^ 60,. 345-351.^ . ' . 

, :.. /* ' ■ Guilford. ^J. P. ■ P>=»v^homtetric M^hods .'^ Nfew York: McGraw:Hill. 

. 1954. ^ i , '.^ • ■ , • ' .. . , . : 

' Hicks.'M. M.'. SohlonTT. F.,^&.Wallmark. M. M. Se x dif fere^ices 

; ■ . m itenr responses on the Gr aduate Record Examination . Papey 

- < presented at. the annual meeting of the National Council of • 



Measurement . in Educatlvpn, Sin-Francispo April 1&76, 
Hoepfner, R. , & Strickland; G. P. Investigating Test BilEis .' * ' 

Los Angeles: , Center fpr- the Study , of Evaluation , University' 
: of California, 1972.- / \ 

Jensen, A. P. An examination of cult ure ,bias In the Wondetlic 

^ — f ^ — — — ^ — r . '. ; 

Personnel Test. Arlington, VA: ERIC Qlearinghous^^ 1973 

■ ' • ' • - ■ . ; o' ■ 

/; ■ ' . . ' 

'(ERIC Document Reproduction Service ED 086 726). o 
Ku^ H. H. , & j^ullbach; S. Loglinear models in contingrency 

table analysis. The American ^Statistician , 1974,, 28, 115-122 
Linn, R. L. Fair test use in selection! Review qf Educational 

Research, 1973, 43, 139-161; 



0 



Lord, F. M. A study of item bias using item characteristic \ 
curve theory .V Proceedings of the Third Congress of Cross 
Cultured Psychelogy, Tilburg, Holland, 1977. ^ ^^-^^ 

Lord, F. M. , & Novick, M. Statistical Theories ot-BleDtal 

Test Scores (2nd ed.). .Reading, MA: . Addls6n-Wesley 1974\ - 

~ ■ ■ V - Q 

< . . . . ' . . * ^ f ■ - • c^' 

Maw, C. Itead bias and information in item ^responses . Paper, 
presented at |>he Psychometric Society Meeting, June 1977. 

Merz, W. R.. Esti|mating bia^ in test items utilizio#l)rincipal : 
components analysis and the general lirieal' solution . P^iper 
presented at the annual meeting of theo American- Educationiil 

Resear-ih As^ociatibn, San^ Francisco , April 1976. o ^ / \ 

.# ' - , ■■ _ .. ■ ' • . • 

Merz , W . R • - Test fairness anrf test bias: ■ A rqyigw of pro^ceduVes 
Paper presented at the USOE Invitational C9nf^ren<^Je on 
Achievetnent Testing* df Dis^advantaged and Minority Students 
for Program Evaluation, Res^ton, VA, Way 1976. 



Ozenne, D. G\ , Van Gelder, .^I. p., & Cohen, A. J. Energizing 

■ . ■ ■ ■ ' . ■ ■ ■ . ' ■ 

' School Aid Act (ESAT) National Evaluation. Achleveiott^nt Test 
J . Restandardizatioh . S ant a|> Monica, CA:' Systems Developrnent * 

Corporation, 1974. V ' , ^ ^, ' • ... 

Petersen. N. S.' 8s Novick, M. R. An- evaluation of some models 

( . ■ 

\ • ■* ■ '' ' . '■ . ' ■* ' » y 

for culture-fair .;select ion. Journal of Educat ional Measure- 

• ment , 1976, 13^ 3-29." / • r . < V 

. ■ ■ ' ' \J 

Pine, iS- M. . Applications oif item characteristip curve theSfy to ' 
' ' ■ ' • ■ ' ■" ^ , .'" ^ J- ■ • 

. the problems of test^tjias. In D. J. Weis^ (Ed.) Applicant ions 

^ of Computerized Adaptive Testin^ ..(RR 77-1). Minneapolis':. 

University of Minnesota Psychometric Methods Program, -Maj-ch 

, ■ ■' 1977. ' ' ■ . ■ ■ . ■ 

Piake, B. S.,, & Hoover', H, An anaiytical method of identify- 

'ing biased tes t' itemg i Paper ^pres^nted at the annual meeting 

' - ; . . / ' •■ ■ • ■ ■■ . 

of the American Educational Research Association. New York*,- 

April . 1977.'' • ' • . , • . • • 

Rudner , L . M . An approach to biased item identif ication . using 

■■ ■ ■ ■ ■ <, ■ ■ ' ' 

latent trait ftieasurement - theory . Paper presented, at the 

annual meeting of the Americafi 'Educational. Research Associa- 
tion, New York, April 'l977a. , ' . . ' 

Rudner, L. M. A closer look at latent trait parameters in - 

va riance . Paper presented at the" annual meeting of the l^ew 
England Educational Research Organization, Manchester, NH, 

. May 1977b. ' ° . . 

Rudner, L. M. An evaluation of select approaches for biased 
item identification . "Unpublished doctoral dissertation, 
The* Catholi^c University of America, 1977c.' • 



1 



.Rudner. L. M. Using standard tests with the hearing impaired: . 

Th» pi»nhletn of ..Item bi&s. ■ Voita Review , 1978. 80, '31-40. 
^ Scheiineroan, J.' A hew method of assessin g bias in test items. 

Paper presented 'at the annual meeting of the- American 
- Educationai Research Association, Wa^^ D.C, April 

• .1975. . ' ■ . - ■ : i 

Scheuneman , J . A procedure for" evaluating, item bias in the / 
' absence of an -outside criterion .- Paper presented at the. 
annual meeting of the American Educational Research Associa- 

tion ,.; April '^976'. . "/ "^^ ' .. ■ ^' " ' ' 

Strassberg-Ros'^berg. . B. ; &Donlon, T . F. Context influences ^ 
oh sex diffe rences in performance and apt itude tests. Paper 

, <f — '■ , . 

presented at the annual meeting p f t. he^Jfationai Council. on ; 
Measurement in Education, Washington, D.C.^ 1975. 

Thorndike, R. L. Concepts of culture-f airness. Journal of 

o " • ■ • , , . ■ ■ ■ .. * 

Educationa:! Measurement , 1971,'' 8, 63-70. • 

~ ' ■■■ - 

Thurstone, L. L. A'.method of scaling psychological and educa- 
tional tests. Journal of Educatio n^ and Psychology. 1925, 

16, 433-451 . ■ ■ * ' ' ' 

Tucker ,'.L. R. An interbattery method of factor analysis. 

Psychometrlka '; 1958, 23, M.1-136. ■ \ 

Urry, V. W. Ancillary estimators for .the para meters of mental - 

test^modei^s . Paper presen.ted at J:h^ American Psychological 

Association Convention, Chicago, August , 1975 

. . - . ■ 

Veale. J. R.. & Foreman , D. f. Cultural v alidity of items aijd. • 



tests: A -new appl-oach . Score Technical ReWt, Iowa City, 
Iowa: Westinghouse Learning Corporation/Heasuremeht Research- 

. Center, 1975. - 



26 



CALIFORNIA STATE UNIVERSITY,' SACRAMENTO 
, PROGRAM DOCUMENTATION 

' " USER MESSAOES 



Program No. ; RGS1 21 , , - 

The user may be Job Control "or any campus agency. Explain the 
necessary actions associated with' all printed messages. 



! ■ - . MESSAGE ■" , 


ACTION 


! . — ' • ^ •• . 

XXXX MEETS' 'AFTER '10 PM 


WARNING TO. ACADEMIC ADMIN. 


EXAMPLE: 

XXXX ANTH Old XXXX ANTmiO 
SAME CLASS 


SHOULD BE CHEXK OUT BY^ ACAD. 
'ADMIN. CORRECT IT IF. NOT , 
REALLY TRUE. ^ 


THERE IS A. TIME CONFLICT 
1 BETWEEN XXXX AND XXXX. 


THE SAME INSTRUCTOR IS LISTED. 
AS TEACHING 2 CLASHES AT ONCE. 
CORRECT RHOUGH RGS127. . 


*V IS AN INVALID DAY OF THE 
WEEK COURSE XXXX. . 


CORRECT' THROUGH RGS127'. 


* 


* > 






c • ^ 




f : ' 


IX-27 -2.^ 


1 



K^Y^^UNCH TNf3THjlCTT0NS 



procediire 



RGS121 - Input 



Opcratioif Ho. * 



Procedure 



Source Document ROOM RESgRVATIOWS 



Dociune.nt Source 



CARD FORiI Al'ia'NUHBER 



„„.v...Mn......;M..;;m;M........;.M 





CArd Field 


Alpha/ 


NO. OI 

Cols . 


Max* \ ftj 
vex XI J 


- . DescriDt^lon' 

o * ■ -t 




1 


1- 8 


- - 


8 




Blank 


■ J — ■ . .•' 


• 2 


9-13 


A 


• '5 ' 


V 


~ — : — ~" — ' r 

Days of .the week , (Required) 




} 


>' 14-17 


• H 


4 


V 


Starting time (required) 




i 


18r.21 


N 


4 


V 


Ending time (required) 




5 


22-23 


N 


2 


V 


Buildioq code. (required) 


1 4. . 


6 


24-28. 


A/N 


5 


V 


Room number (required ) 








A 


Id , 


V 


Tnctriirtnr ha'me 


/ • ' > 


L-fc 


39 


N 


1 


V 


<;rhnp1 code (reauifed) 








A 


5 , 


V 


nApartment name (reauireil) _/ 




.iu. 


45-49 


A/N 


5- 


V 


Course Number ^ / 




.11 


50-80 




31 


V 


B 1 a^1( ■ ■ /■ 




12 


n ' 












12 














111' 








ft 






15 


i 












16 










- = ■ / . 


17 1 










• . ■ v ] 





locuments to Next Operation 



□ 



Cariis to "ext Operation 



o 



other 



Other 



ERJC p pared by. 



Clyde M. King 



0V2^74 



